Photo by Jovana Askrabic on Unsplash
The goal of this assignment is to introduce you to R, RStudio, Git, and GitHub, which youโll be using throughout the course both to learn the data science concepts discussed in the course and to analyze real data and come to informed conclusions.
By the end of this assignment, you will be able to:
Donโt worry if these terms are unfamiliar! Weโll walk through each step carefully.
This assignment assumes that you have reviewed the lectures titled โMeet the toolkit: Programmingโ and โMeet the toolkit: version control and collaborationโ. If you havenโt yet done so, please pause and complete the following before continuing.
Weโve already thrown around a few new terms, so letโs define them before we proceed.
R: Name of the programming language we will be using throughout the course.
RStudio: An integrated development environment for R. In other words, a convenient interface for writing and running R code.
Git: A version control system.
GitHub: A web platform for hosting version controlled files and facilitating collaboration among users.
Repository: A Git repository contains all of your projectโs files and stores each fileโs revision history. Itโs common to refer to a repository as a repo.
As the course progresses, you are encouraged to explore beyond what the assignments dictate; a willingness to experiment will make you a much better programmer! Before we get to that stage, however, you need to build some basic fluency in R. First, we will explore the fundamental building blocks of all of these tools.
Before you can get started with the analysis, you need to make sure you:
have a GitHub account
are a member of the course GitHub organization
have successfully logged in and authenticated in the JupyterHub
If you failed to confirm any of these, it means you have not yet completed the prerequisites for this assignment. Please go back to Prerequisites and complete them before continuing the assignment.
IMPORTANT: If there is no GitHub repo created for you for this assignment, it means I didnโt have your GitHub username as of when I assigned the homework. Please let me know your GitHub username asap, and I can create your repo.
For each assignment in this course you will start with a GitHub repo that I created for you and that contains the starter documents you will build upon when working on your assignment. The first step is always to bring these files into RStudio so that you can edit them, run them, view your results, and interpret them. This action is called cloning.
Then you will work in RStudio on the data analysis, making commits along the way (snapshots of your changes) and finally push all your work back to GitHub.
The next few steps will walk you through the process of getting information of the repo to be cloned, cloning your repo in a new RStudio project, and getting started with the analysis.
On GitHub, click on the green Code button, select HTTPS (this might already be selected by default, and if it is, youโll see the text Use Git or checkout with SVN using the web URL as in the image on the right). Click on the clipboard icon ๐ to copy the repo URL.
In RStudio, click on the down arrow next to New Project and then choose New Project from Git Repository.
In the pop-up window, paste the URL you copied from GitHub, make sure the box for Add packages from the base project is checked (it should be, by default) and then click OK.
โ Checkpoint: You should now see your project files in the Files pane (bottom right). If you donโt see a file called Homework Instructions, something went wrong - ask for help before continuing.
RStudio is comprised of four panes.
2 + 2 here and hit enter, what do you get?x <- 2 in the Console and hit enter, what do you get in the Environment pane? Importantly, this pane is also where the Git interface lives. We will be using that regularly throughout this assignment.Before we introduce the data, letโs warm up with some simple exercises.
The top portion of your R Markdown file (between the three dashed lines) is called YAML. It stands for โYAML Ainโt Markup Languageโ. It is a human friendly data serialization standard for all programming languages. All you need to know is that this area is called the YAML (we will refer to it as such) and that it contains meta information about your document.
Open the R Markdown (Rmd) file in your project, change the author name to your name, and knit the document.
Then Go to the Git pane in your RStudio.
You should see that your Rmd (R Markdown) file and its output, your md file (Markdown), are listed there as recently changed files.
Next, click on Diff. This will pop open a new window that shows you the difference between the last committed state of the document and its current state that includes your changes. If youโre happy with these changes, click on the checkboxes of all files in the list, and type โUpdate author nameโ in the Commit message box and hit Commit.
You donโt have to commit after every change, this would get quite cumbersome. You should consider committing states that are meaningful to you for inspection, comparison, or restoration. In the first few assignments we will tell you exactly when to commit and in some cases, what commit message to use. As the semester progresses we will let you make these decisions.
Now that you have made an update and committed this change, itโs time to push these changes to the web! Or more specifically, to your repo on GitHub. Why? So that others can see your changes. And by others, we mean the course teaching team (your repos in this course are private to you and us, only). In order to push your changes to GitHub, click on Push.
This will prompt a dialogue box where you may need to authenticate with GitHub. Follow the prompts to complete the authentication process.
Note: The first time you push, you may need to set up authentication. If you encounter issues, refer to the authentication guide posted on Canvas or ask for help during office hours.
Thought exercise: Which of the above steps (updating the YAML, committing, and pushing) needs to talk to GitHub?1 Only pushing requires talking to GitHub, this is why youโre asked for your password at that point.
โ Checkpoint: After pushing, go to your GitHub repo in your web browser and refresh the page. You should see your updated file with your name in it. If you donโt see the changes, try pushing again or ask for help.
R is an open-source language, and developers contribute functionality to R via packages. In this assignment we will use the following packages:
We use the library() function to load packages. In your R Markdown document you should see an R chunk labelled load-packages which has the necessary code for loading both packages. You should also load these packages in your Console, which you can do by sending the code to your Console by clicking on the Run Current Chunk icon (green arrow pointing right icon).
Note that these packages also get loaded in your R Markdown environment when you Knit your R Markdown document.
โ Checkpoint: If the packages loaded successfully, you should see no error messages in red. If you see an error like โthere is no package calledโฆโ, let your instructor know.
The city of Seattle, WA has an open data portal that includes pets registered in the city. For each registered pet, we have information on the petโs name and species. The data used in this exercise can be found in the openintro package, and itโs called seattlepets. Since the dataset is distributed with the package, we donโt need to load it separately; it becomes available to us when we load the package.
You can view the dataset as a spreadsheet using the View() function. Note that you should not put this function in your R Markdown document, but instead type it directly in the Console, as it pops open a new window (and the concept of popping open a window in a static document doesnโt really make senseโฆ). When you run this in the console, youโll see the following data viewer window pop up.
You can find out more about the dataset by inspecting its documentation (which contains a data dictionary, name of each variable and its description), which you can access by running ?seattlepets in the Console or using the Help menu in RStudio to search for seattlepets.
As you work through this assignment, you might encounter some issues. Here are the most common ones:
Knitting errors:
library(tidyverse) and library(openintro) in your Console firstGit/GitHub issues:
R Markdown issues:
General tips:
There are 52,519 pets included in the dataset. I determined this by running ?seattlepets in console and it said there were 52,519 rows in the dataframe.
After completing this exercise:
"Completed Exercise 1"There are 7 variables for each pet. I determined this by running ?seattlepets in console and it said there were 7 variables in the dataframe.
After completing this exercise:
๐งถ Knit โ โ
Commit with message "Completed Exercise 2" โ โฌ๏ธ Push
The three most common pet names in Seattle are Lucy, Charlie, and Luna.
The two lines of code can be read as โStart with the seattlepets data frame, and then count the animal_names, and display the results sorted in descending order. The โand thenโ in the previous sentence maps to %>%, the pipe operator, which takes what comes before it and plugs it in as the first argument of the function that comes after it.โ
## # A tibble: 13,930 ร 2
## animal_name n
## <chr> <int>
## 1 <NA> 483
## 2 Lucy 439
## 3 Charlie 387
## 4 Luna 355
## 5 Bella 331
## 6 Max 270
## 7 Daisy 261
## 8 Molly 240
## 9 Jack 232
## 10 Lily 232
## # โน 13,920 more rows
Write your answer in your R Markdown document under Exercise 3. In this exercise you will not only provide a written answer but also include some code and output. You should insert the code in the code chunk provided for you, knit the document to see the output, and then write your narrative for the answer based on the output of this function, and knit again to see your narrative, code, and output in the resulting document.
After completing this exercise:
๐งถ Knit โ โ
Commit with message "Completed Exercise 3" โ โฌ๏ธ Push
Letโs also look to see what the most common pet names are for various species. For this we need to first group_by() the species, and then do the same counting we did before.
Looks like many of those NAs were cats. Poor unnamed kittiesโฆ
## # A tibble: 16,823 ร 3
## # Groups: species [4]
## species animal_name n
## <chr> <chr> <int>
## 1 Cat <NA> 406
## 2 Dog Lucy 337
## 3 Dog Charlie 306
## 4 Dog Bella 249
## 5 Dog Luna 244
## 6 Dog Daisy 221
## 7 Dog Cooper 189
## 8 Dog Lola 187
## 9 Dog Max 186
## 10 Dog Molly 186
## # โน 16,813 more rows
But this output isnโt exactly what we wanted. We wanted to know the most common cat and dog names, but there are barely any cats present in this output! This is because there are more dogs than cats in the dataset overall. We can confirm this by counting the various species in the data.
6 pigs in the city? Okโฆ But weโll continue with cats and dogs.
## # A tibble: 4 ร 2
## species n
## <chr> <int>
## 1 Dog 35181
## 2 Cat 17294
## 3 Goat 38
## 4 Pig 6
slice_max() function. The first argument in the function is the variable we want to select the highest values of, which is n. The second argument is the number of rows to select, which is n = 5 for the top 5. It may be a bit confusing that both of these are n, but this is because we already have a variable called n in the data frame.seattlepets %>%
group_by(species) %>%
count(animal_name, sort = TRUE) %>%
slice_max(n, n = 5) %>%
arrange(species,n)## # A tibble: 53 ร 3
## # Groups: species [4]
## species animal_name n
## <chr> <chr> <int>
## 1 Cat Max 83
## 2 Cat Lily 86
## 3 Cat Lucy 102
## 4 Cat Luna 111
## 5 Cat <NA> 406
## 6 Dog Daisy 221
## 7 Dog Luna 244
## 8 Dog Bella 249
## 9 Dog Charlie 306
## 10 Dog Lucy 337
## # โน 43 more rows
Based on the previous output we can easily identify the most common cat and dog names in Seattle, but the output is sorted by n (the frequencies) as opposed to being organized by the species. Build on the pipeline to arrange the results so that theyโre arranged by species first, and then n. This means you will need to add one more step to the pipeline, and you have two options: arrange(species, n) or arrange(n, species). You should try both and decide which one organizes the output by species and then ranks the names in order of frequency for each species.
Which option groups all the cats together and all the dogs together, with names ranked by frequency within each species?
running arrange(species,n) groups all the cats together and all the dogs together with names ranked by frequency within each species
After completing this exercise:
๐งถ Knit โ โ
Commit with message "Completed Exercise 4 โ โฌ๏ธ Push
Tip: You donโt need to understand all the code that creates this visualization - that will come later in the course. For now, just look at the plot and answer the questions about what you observe.
After completing this exercise:
๐งถ Knit โ โ
Commit with message "Completed Exercise 5" โ โฌ๏ธ Push
To submit to Canvas:
โ Final Checkpoint: Visit your GitHub repo one more time to confirm all your work is there. We will grade what we see in your repo on GitHub!